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Abstract 

Here we describe work on learning the 
subcategories of verbs in a morphologi- 
cally rich language using only minimal 
linguistic resources. Our goal is to learn 
verb subcategorizations for Quechua, an 
under-resourced morphologically rich lan- 
guage, from an unannotated corpus. We 
compare results from applying this ap- 
proach to an unannotated Arabic corpus 
with those achieved by processing the 
same text in treebank form. The origi- 
nal plan was to use only a morphologi- 
cal analyzer and an unannotated corpus, 
but experiments suggest that this approach 
by itself will not be effective for learn- 
ing the combinatorial potential of Arabic 
verbs in general. The lower bound on re- 
sources for acquiring this information is 
somewhat higher, apparently requiring a 
a part-of-speech tagger and chunker for 
most languages, and a morphological dis- 
ambiguater for Arabic. 

1 Introduction 

When constructing NLP systems for a new lan- 
guage, we often want to know the valence of 
its verbs, which is to say how many and which 
types of arguments each verb may combine with. 
Some dictionaries may provide this information, 
but even assuming a broad-coverage machine- 
readable dictionary exists a given language, that 
dictionary may not say whether arguments are op- 
tional for a given verb, or how likely they are to 
occur. 

Knowing the selectional preferences and re- 
quirements of verbs is useful for systems that have 
explicit lexicalized grammars of the languages 
they cover, whether for parsing, generation, or 



of this work was to build resources for use in L3 



(Gasser, 2011), a rule -based machine translation 



both (Bri scoe and Carroll, 1997| ), and of linguis- 
tic interest on its own (Gahl et al., 2004). The aim 



system based on dependency grammars, which 
records the combinatorial possibilities for every 
word in its lexicon, and during parsing and gen- 
eration constructs a graph describing the struc- 
ture of the input and output sentences. We are 
particularly interested in linguistic resources for 
Quechua, which is spoken by roughly 10 million 
people in the Andean region of South America, 
and is thus the largest indigenous language of the 
Americas. However, evaluating the approach for 
Quechua is difficult, due to a lack of existing lex- 
ica and treebanks, so initial experiments have been 
carried out with Arabic. 

An empirical approach based on a corpus or 
treebank allows us to learn the relative frequency 
with which a given verb takes specific types of ar- 
guments. As a simple example from English, we 
would like to be able to learn that while "eat" usu- 
ally has a direct object, "put" nearly always has 
one. Verbs may also occur with clausal depen- 
dents in various ways. For some examples in En- 
glish, see Figure [T] 

In order to automatically learn this informa- 
tion for resource-scarce, morphologically rich lan- 
guages, we set out to implement a system that 
requires only an unannotated corpus and a mor- 
phological analyzer; other recent approaches have 
made use of more linguistic knowledge, in the 
form of treebanks, parsers, or chunkers. In prac- 
tice, our we will also require more resources to be 
fruitful; this may be addressed in the future. 

2 Related work 

Many other researchers have addressed the prob- 
lem of documenting the properties of the differ- 
ent verbs in a given language, using evidence from 
corpora and manual lexicography. Automatic ap- 
proaches have the potential to involve less manual 
work avoid human biases, giving a more objective 



* I put. I believe that he is tall. 

* I put the potato. I consider him tall. 

I put the potato on the table. * I consider that he is tall. 



Figure 1: "What did you do yesterday?", and believe vs. consider 



measure of the behavior of a given verb. 

We see in the literature a few different terms 
that describe the combinatorial potential of a 
verb, including subcategorization, subcategoriza- 
tion frames, valence or valency. In any case, these 
terms describe which arguments and adjuncts may 
appear with a given verb, how often, and which 
ones are obligatory. While describing similar no- 
tions, these terms do not seem to be interchange- 
able; while this work is concerned with with "sur- 
face level" syntax, looking for arguments that 
are present in practice (such that a parser could 
find them), in Functional Generative Description 
(FGD), the term valency refers to a tectogram- 
matical notion; arguments might be known to the 
speaker but not expressed. This deeper notion can- 
not be readily observed from text alone, as pointed 



out by Bojar(2003). 



2.1 Valency lexica for English 



Gahl et al. ( 2004 ) describe a study in which they 
had a team of linguists annotate thousands of En- 
glish sentences - 200 sentences for each of 281 
English verbs of interest - and build a table of 
distributions of subcategorization frames that they 
observed for each of the verbs. They describe the 
difficulties that may be faced in trying to learn sub- 
categorizations from a corpus: in a given body of 
text (even one as big as the Brown corpus), it may 
be that not all possible subcategorizations will be 
observed. Additionally, different genres of text 
may exhibit different verb usage. This paper also 
gives a good overview of the uses of valency in- 
formation and a view on verb subcategorization 
from psycholinguistics, including elicitation ex- 
periments that psycholinguists have used to learn 
the relative frequencies of different uses of verbs. 

Gahl's group has made their results available 
in machine-readable form, providing a potentially 
useful resource for those interested in English 
verbs. However, their approach was very labor- 
intensive and required a large corpus. 

Ushioda et al. (1993 ) describe earlier work 
on acquiring verb subcategorizations for English. 
Their method requires a tagged corpus, although 



an untagged corpus and an accurate tagger would 
work as well. On the basis of the tags, they 
perform partial parsing to identify noun phrases 
(chunking), and then use some simple rules spec- 
ified in terms of regular grammars to identify 
common patterns of constituents in the sentences, 
which are marked with corresponding subcatego- 
rization frames. This approach does not require 
the use of a deep parser, but the rules had to be 
crafted specifically for English. 

Ushioda explores the WSJ corpus with this ex- 
tractor, and reports results on 33 randomly se- 
lected common verbs: the extraction rules achieve 
86% accuracy over sentences from a test set taken 
from WSJ, where the correct subcategorization 
frames for the test set had been determined manu- 
ally. 



Brent (1993 1 addresses the seeming impasse 
that in order to get accurate parses automatically, 
one needs to know about the syntactic frames of 
different verbs, but in order to get the frame in- 
formation from a corpus, the sentences must be 
parsed accurately. He handles this problem by 
crafting language-specific rules that initially only 
refer to closed-class words and do not require 
complete parses. This approach is somewhere be- 
tween having no syntactic knowledge at all, and 
requiring a large grammar of the target language: 
to start out, it must first figure out which words in 
the corpus are verbs. He then uses statistics to in- 
fer previously unknown facts about the language, 
for example, which English verbs can occur with 
each of six different kinds of arguments. 

While this appears to be an effective approach, 
one wonders how hard it would be to apply to an 
unfamiliar language. Producing the initial rules 
may require a lot of linguistic insight; for example, 
Brent relies on the fact that in English, verbs typi- 
cally do not appear immediately after determiners 
or prepositions. In a language with more free word 
order, or one without determiners or prepositions, 
what sorts of rules might one use? 



Briscoe and Carroll (1997) describe a system 
that finds subcategorization frames for verbs in 
English, including relative frequencies for each 



class for a given verb. They adopt a very detailed 
scheme for verb subcategories, in which each us- 
age falls into one of 160 different classes, where 
each class includes specific information about par- 
ticles and control of the arguments of the verb. 
Their system requires the use of a POS tagger, a 
lemmatizer, and a pre-trained probabilistic parser; 
after identifying and classifying the different sub- 
categories of verb usages, they incorporate this in- 
formation into another parser and demonstr ate that 
it improves parsing accuracy. 

2.2 Valency lexica for Slavic languages 



The VALLEX project (Zabokrtsky and Lopafkova, 
2007) has produced a large hand-curated database 



of valency frames for verbs in Czech, covering 
roughly the 2500 most common verbs in Czech 
and cataloging their various senses. VALLEX 
makes use of Functional Generative Description 
as the background linguistic theory for its ac- 
count of verbs, and so records, at least, whether 
a verb sense takes an actor, addressee, patient, ef- 
fect, and origin, and whether these must be spec- 
ified, as well as a large number of other "quasi- 
valency complementations" and "free modifica- 
tions". VALLEX provides a very detailed account 
of the potential uses of each verb in its lexicon, 
much more detailed than what can currently be 
produced with automatic methods. 

More recently, Przepiorkowski has done work 
focusing on Polish, comparing valence dictionar- 
ies built with the use of shallow parsing to those 



built with deep parsing (2009 1. Because his shal- 
low parser may not handle all of the sentences in 
the corpus, his approach ends up ignoring more 
than half of the training data, but from the remain- 
ing 4 1 % of the IPI PAN Corpus, he collects counts 
of the different frames in which each verb was ob- 
served, and uses a small number of Polish-specific 
rules to post-process the observations, then does 
statistical filtering to try to reduce noisy observa- 
tions. 

Przepiorkowski evaluates the extracted lexical 
information in two different ways, making use 
of both pre-existing valence dictionaries and sen- 
tences hand-annotated by linguists, finding that his 
shallow-parsing technique actually produces re- 
sults that agree more closely with frames that were 
observed in the texts by linguists than the existing 
valence dictionaries. 



tracting valence information and frame weights for 
Polish that makes use of a non-probabilistic deep 
parser and a novel use of EM, which he says is 
simpler than the more traditional repeated inside- 
outside approach to optimizing weights for a prob- 
abilistic grammar. Additionally, in his EM formu- 
lation, the weight-optimization problem is convex, 
so he can start with uniform prior probabilities and 
be guaranteed to get a globally optimum solution. 
Debowski also includes an approach for filtering 
incorrect frames that were found in the parsed text. 

When analyzing his results, Debowski notes 
that some of his observed "false positives" de- 
scribed valid uses of the verbs in question, but 
were not included in the compiled valence dictio- 
naries that he used in evaluating his approach. 

2.3 Valency Lexica for Arabic 

Informed by the Prague Arabic Dependency Tree- 
bank and the Functional Generative Description 



(FGD) theory of syntax, Bielicky and Smrz (2008 1 



describe desiderata for a valency lexicon for Ara- 
bic. They do not describe the production of such 
a lexicon in practice, but lay out a framework for 
discussing one, proposing a structure for lexical 
entries in the valence dictionary. Their structure is 
based on VALLEX, which seems to have a broadly 
applicable formalism for describing verbs. They 
also describe some tools useful for the task, in- 
cluding an FST-based morphological analyzer for 
Arabic, and explain FGD's account of verbal argu- 
ments/adjuncts. 

2.4 Resources for Quechua 



Debowski (2009) presents a procedure for ex- 



Rios et al. ( 2009 ) address the more general prob- 
lem of acquiring enough linguistic knowledge to 
build effective NLP systems for under-resourced 
languages such as Quechua, with a more labor- 
intensive approach. They describe their construc- 
tion of a phrase-aligned treebank for Quechua and 
Spanish, which covers about 200 sentences, with 
text from the Declaration of Human Rights (avail- 
able in many languages, including Spanish and 
Quechua) and the website of La Defensoria del 
Pueblo, a Peruvian government organization that 
advocates for citizens rights. Aside from the mor- 
phological analysis of Quechua, the treebanking 
and alignment process currently require human at- 
tention, though this may be partially automated in 
the future. 

The treebank so far is small, but it may be in- 
creasingly useful for machine translation as their 



treebanking process becomes more automated. 
Rios et al. note a surprising number of avail- 
able bitexts for Spanish/Quechua, including polit- 
ical texts, news, translated novels and poetry. 

3 Proposed Approach 

Our approach starts by processing each sentence in 
the corpus with the morphological analyzer, thus 
finding all of the verbs. For sentences with only 
one verb, we then count the occurrences of nouns 
that seem to be, because of inflection, the argu- 
ments of the verb. Here plausible verb arguments 
will need to be identified with a small number of 
language-specific heuristics. For example, a noun 
inflected with the accusative case in a sentence 
with a verb and a clear subject will likely be the 
object of that verb. This approach throws away the 
information from sentences with multiple verbs 
and embedded clauses, but it does not require syn- 
tactic analysis. We had hoped that the frequen- 
cies learned with this approach will approximate 
the frequencies that would be learned using deeper 
syntactic analysis, but this does not bear out em- 
pirically. 

Noisy observations could be filtered out us- 
ing an approach similar to the one described in 



(Przepiorkowski, 2009). In the long run, for con- 
sistency, we would like to build a lexicon in the 
VALLEX style, discovering whether each given 
verb usage contains an explicit Actor, Addressee, 
Patient, Effect, and Origin, when these roles can 
be identified by the morphological cues. 

3.1 Evaluating Valency Learning Techniques 

When building a system that builds valency lexica 
for the verbs of a given language, we would like 
both good recall, meaning that the system iden- 
tifies a many of the verb usages that are actually 
present in the training text, and high precision, 
meaning that the answers the system returns are 
actually correct. To measure both of these, we can 
take some preexisting lexicon to be the gold stan- 
dard, but good valency lexica are not available for 
most languages. 

What we can do instead is take the verb us- 
ages in a treebank, and consider the subcatego- 
rization lexicon constructed in that way to be the 
gold standard. We have an Arabic treebank (Ara- 
bic Treebank Part 1, v3.0) available from the LDC 



(Maamouri et al., 2005 1, so for this work we make 



use both of that treebank and the associated flat 



text. We chose Arabic for its rich morphology, and 
for the somewhat convenient, though not freely re- 
distributable, treebank. If the results were good for 
Arabic, then that would be evidence that it might 
be helpful for constructing valency lexica for other 
languages as well. 

4 Experiments with Arabic 

We carried out experiments with the text of the 
Arabic treebank, using both the transliterated text 
with syntactic annotations and the unannotated 
Unicode text in Arabic script. Given the tree- 
bank annotations, we can find the verbs in each 
sentence, as well as the other components of the 
verb phrases, quite easily by traversing the parse 
trees. For initial experiments due to the sparsity 
of the data, we pass over the problem of deciding 
whether a constituent is an argument or adjunct of 
the verb. 

To find all of the verb subcategory frames in the 
treebank, we traverse the tree of each sentence and 
record the immediate children of the verb phrase 
that are not the verb itself. These are considered 
a set, and recorded with the stem of the verb in 
question. The process is described in more detail 
in Figure [2] 

Considering the entire treebank, which con- 
sists of 734 news documents, there are 5845 sen- 
tences, containing 14115 verb phrases. The ma- 
jority (92%) of these verb phrases have a verb that 
can be found with the rules described. The most 
common verb stems, presented in Buckwalter 
transliteration, were: "kAn", "qAl", ">aEolan", 
">aDAf", ">ak ad", "kuwn", ">awoDaH", 
"*akar", "mokin", ">afAd". Each of these oc- 
curred at least 100 times in the corpus. Not all of 
these can be translated sensibly to English with- 
out context by Google Translate, but using it as a 
glossing tool, we get: was, declared, added, con- 
firmed, fact, clear, enabled, and reported. There 
were 1747 different verb stems observed alto- 
gether. 

Adapting the approach of Przepiorkowski 
(2009), we then focus on the sentences from the 
corpus that contain only one verb. This allows 
us to avoid making attachment decisions, since 
deep parsers may not be available or reliable. Bo- 
jar (2003 ) does something similar, with the addi- 
tion of a chunker that can find subordinate clause 
boundaries. This approach also seems sensible 
particularly for Quechua and Arabic, since case 



• For each sentence in the treebank.. 



- For each verb phrase in that sentence... 

* Look for a word in the VP with a tag that contains one of IV, PV, IV_PASS or PV_PASS 
(one may not be present; if so, skip this VP) 

* Find the stem for the verb, if present 

* Record the verb stem, along with the tags of the sibling constituents. 

Figure 2: Process for finding verbs and arguments in the treebank 



is typically marked on nouns for both languages, 
although this still leaves the problem of dropped 
arguments. 

On its own, filtering out a large number of sen- 
tences is not a problem; that we include a nontriv- 
ial fraction of the sentences at all is promising. To 
improve coverage, we could simply feed the sys- 
tem more unannotated text. For Arabic, we could 
use the very large supply of Arabic news available 
on the web. This approach would be less plausi- 
ble for Quechua, although of course unannotated 
Quechua text is more plentiful than Quechua tree- 
banks. 

4.1 Sentence Selection in Practice 

We might wonder, however, whether our sampling 
of sentences leads to biases in the observed verbs 
and their usages. Considering English verbs such 
as "think", "believe", or "request", which usually 
occur with some clausal argument that includes 
some other verb, we imagine that the analogous 
verbs in the language we are investigating would 
be under-represented or simply not learned at all. 

Experiments showed that both of these worries 
are well-founded: sentences that had only one 
verb had a substantially different distribution of 
verb stems. The verbs that were most common 
in the one-verb sentences were "saj al", "qAl", 
">aEolan", "$Arik", "kAn", "daEA", "fAz", 
"balag", ">aHoraz", and "lotaqiy". These are 
glossed by Google Translate as: record, said, an- 
nounced, was, called, beat, was, made, and assess, 
definitely a different sort of verb than the ones we 
see commonly in the in the text generally. 

Even among the verbs that happen commonly 
in both the text in general and the one-verb sen- 
tences, we observe different usages. The most 
common verb in general, "kAn" (was), occurs 
with another verb phrase as an argument about 
400 times in the treebank. The second-most com- 
mon verb in general and in the one-verb sentences, 



"qAl" ("declared/said"), most often occurs with an 
SBAR (indicating a nested clause) in general, but 
of course these usages do not occur in the one- verb 
sentences. 

4.2 Morphological ambiguity 

In experiments with the nearly- unannotated^ text 
distributed with the treebank, we made use of 
AraMorph (Brihaye, 2004), a Free Software ver- 
sion of the Buckwalter morphological analyzer 
that handles Unicode text. The goal with the unan- 
notated text was to see which subcategory frames 
we could observe in sentences with a single verb 
- the rich Arabic morphology usually marks case 
on nouns, which should allow us to find many ar- 
guments to verbs. We could then compare the va- 
lencies learned from the unannotated corpus with 
those that are more easily observable from the 
treebank. If the valencies that we discover with the 
unannotated approach are close to those learned 
from the treebank, and we get a broad cover- 
age over the verbs observed in the corpus, then 
this would provide an argument that the technique 
works fairly well for Arabic, and we could con- 
tinue using it as we acquire more textual data for 
more under-resourced languages. 

However, Buckwalter-style morphological ana- 
lyzers do not account for morphological ambigu- 
ity, which would present difficulties in the long 
run. This problem is particularly dire because the 
Arabic script is an abjad, which is to say that it 
only records the consonants for each word. There 
is an optional system for annotating the vowel 
sounds as well, but it is often not used in prac- 
tice. This presents a problem for Arabic-language 
NLP systems, though, since a word without con- 
text rarely has a unique morphological analysis. 
In fact, within the corpus, we observed a mean of 
about 7.5 possible Buckwalter analyses per word, 
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Figure 3: Filtering results on sentences from the Penn Arabic Treebank 



with a = 8.4, and a maximum of 86. 

We also briefly experimented with a Quechua 
morphological analyzer, Michael Gasser's 
AntiMorfo system (2010); it can analyze 
Quechua verbs, nouns, and adjectives. We also 
see morphological ambiguity in Quechua words, 
although it is not nearly so striking, largely due 
to the orthography with vowels. We saw a mean 
of 1.7 analyses per word, with a = 1.3, and 
a maximum of 10. For Quechua, we used a 
small corpus produced by the AVENUE project, 
described in (Monso n et al., 2006| ), which in- 
cludes bitext elicited from native speakers, and 
monolingual text, both from UN documents and 
local stories. Interestingly, with the FST-based 
morphological analysis, we cannot tell whether a 
word is definitely a verb. For example, waqaychu 
may be a verb, noun, or infinitive, as "waqa" 
is in AntiMorfo's lexicon as a verb root, and 
"waqay" as a noun root. So to get good results 
for Quechua, we would need a POS tagger or 
other means to choose between analyses. As 
far as treebanks of Quechua, for evaluating an 
automatically-extracted Quechua verb lexicon, 
to our knowledge there is no large one available, 
although Rios et al. are developing a small one 



helpful to have a chunker to find the boundaries of 
clauses and noun phrases, as in ([Przepiorkowski, 
2009). For Czech verbs, Bojar similarly used 
finite-state rules to find coordinated and subordi- 



(2009). 



5 Conclusions and future work 

While the problem of discovering grammatical 
subcategories of verbs remains interesting, and so- 
lutions to that problem would be of practical use, 
it is almost definitely not enough to use only a 
morphological analyzer and a medium-sized unan- 
notated corpus for this purpose; disregarding sen- 
tences with more than one verb leads to mislead- 
ing view of the language as a whole, and selection 
preferences learned from these sentences would 
not be suitable for parsing most sentences. To 
make better use of the existing data, it would be 

2 The word with 86 Buckwalter analyses can mean, at 
least, "one", "and scrutinize", "and sharpen", or "and be fu- 
rious". 



nate clauses (2003). 



While ambiguity in morphological analysis can 
be a hurdle in any language, the vowel-free na- 
ture of typical Arabic text presents a particularly 
serious one; we would like to be able to use any 
available unannotated text, but without morpho- 
logical disambiguation, we are left with many pos- 
sible interpretations for most tokens. This could 
be mitigated with software like MADA+TOKAN 
(2009), which chooses the most likely morpholog- 
ical analysis for a given context; for other lan- 
guages, such as Quechua, similar tools will need 
to be developed. 

Even with proper POS tagging, morphological 
disambiguation, and chunking, we are still faced 
with the problem of negative evidence; without a 
very large corpus, we cannot say with confidence 
that a given verb cannot appear with certain ar- 
guments, simply because it has not yet been ob- 
served with those arguments. This difficulty sug- 
gests an active learning approach, perhaps coupled 
with crowdsourcing. We could imagine a system 
that generates sentences to test the hypothesis that 
a given verb may be used with a given subcate- 
gorization frame, then presents those sentences to 
human users for grammaticality judgments. 
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